Clustering and classification

Data description

For this week analysis I use Boston dataset from the MASS package. It contains housing values in suburbs of Boston and consists of 506 observations of the following 14 variables:

  1. crim – per capita crime rate by town
  2. zn – proportion of residential land zoned for lots over 25,000 sq.ft.
  3. indus – proportion of non-retail business acres per town
  4. chas – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. nox – nitrogen oxides concentration (parts per 10 million)
  6. rm – average number of rooms per dwelling
  7. age – proportion of owner-occupied units built prior to 1940
  8. dis – weighted mean of distances to five Boston employment centres
  9. rad – index of accessibility to radial highways
  10. tax – full-value property-tax rate per $10,000
  11. ptratio – pupil-teacher ratio by town
  12. black – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  13. lstat – lower status of the population (percent)
  14. medv – median value of owner-occupied homes in $1000s

Exploratory analysis

First, let’s explore the dataset a bit:

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Let’s have a closer look at variables and their distributions.

## Loading required package: ggplot2
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can check relationship between variables using a separate correlation plot:

## ── Attaching packages ─────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.1       ✔ purrr   0.3.2  
## ✔ tidyr   0.8.3       ✔ dplyr   0.8.0.1
## ✔ readr   1.3.1       ✔ stringr 1.4.0  
## ✔ tibble  2.1.1       ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## corrplot 0.84 loaded
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

The highest correlations are observed between:

  • proportion of non-retail business acres (indus) and distances to Boston employment centresdis (dis) are strongly negatively correlated: -0.71
  • nitrogen oxides concentration (nox) and distances to Boston employment centresdis (dis) are strongly negatively correlated: -0.77
  • proportion of owner-occupied units built prior to 1940 (age) and distances to Boston employment centresdis (dis) are strongly negatively correlated: -0.75
  • lower status of the population (lstat) and median value of owner-occupied homes (medv) are strongly negatively correlated: -0.74
  • index of accessibility to radial highways (rad) and full-value property-tax rate (tax) are strongly positively correlated: 0.91
  • index of accessibility to radial highways (rad) and full-value property-tax rate (tax) are strongly positively correlated: 0.91
  • proportion of non-retail business acres (indus) and nitrogen oxides concentration (nox) are strongly positively correlated: 0.76
  • proportion of non-retail business acres (indus) and full-value property-tax rate (tax) are strongly positively correlated: 0.72
  • nitrogen oxides concentration (nox) and proportion of owner-occupied units built prior to 1940 (age) are strongly positively correlated: 0.73
  • owner-occupied homes (medv) and average number of rooms per dwelling (rm) are strongly positively correlated: 0.70

Data wrangling

Since my variables are measured on different scales, for further analysis I have to scale the data, subtracting the column means from the corresponding columns and dividing the difference with standard deviation.

boston_scaled <- scale(Boston)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)

As it can be seen, all the variables’ means are zeros now after scaling.

I also create a factor variable for numerical crim, categorizing it into high, low and middle rates of crime.

bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127

I remove the initial variable crim and add the new categorical one.

boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

I split my data into test (20%) and train (80%) sets in order to assess the model’s (which I’m going to build) quality. The training of the model will be done with the train set and prediction on new data is done with the test set.

n <- nrow(boston_scaled)
set.seed(12345)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Linear discriminant analysis (LDA)

lda_model <- lda(crime~., data=train)
lda_model
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2475248 0.2574257 0.2475248 0.2475248 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.97229754 -0.9597872 -0.154216061 -0.8889207  0.4570563
## med_low  -0.08464861 -0.3244885 -0.007331936 -0.5892641 -0.1082762
## med_high -0.39119540  0.1704608  0.200122961  0.3770782  0.1375651
## high     -0.48724019  1.0171519 -0.075474056  1.0547756 -0.5001506
##                 age        dis        rad        tax     ptratio
## low      -0.8780782  0.9035281 -0.6913090 -0.7279533 -0.39471367
## med_low  -0.3707589  0.3894044 -0.5434660 -0.5306788 -0.04341686
## med_high  0.4098247 -0.3529031 -0.4041923 -0.3146912 -0.31203261
## high      0.8165907 -0.8472627  1.6377820  1.5138081  0.78037363
##               black      lstat        medv
## low       0.3857136 -0.7891212  0.51279116
## med_low   0.3160280 -0.1608905  0.01640163
## med_high  0.0563347 -0.0252292  0.17942536
## high     -0.6736089  0.9100659 -0.66442722
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.06453072  0.74240382 -0.88262071
## indus    0.06465153 -0.34822618  0.43060847
## chas    -0.07437636 -0.09918695  0.12926793
## nox      0.32119190 -0.73282709 -1.30214445
## rm      -0.12621343 -0.11008660 -0.18940456
## age      0.24857359 -0.28701321 -0.33705453
## dis     -0.06379877 -0.31544376  0.10788004
## rad      3.04902524  0.98466813  0.09844024
## tax      0.12625765 -0.05144994  0.20511932
## ptratio  0.09254812  0.02946943 -0.18922812
## black   -0.09973884  0.03443536  0.12859239
## lstat    0.21995230 -0.17129701  0.51685693
## medv     0.22046620 -0.39973922 -0.12354310
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9459 0.0403 0.0139

Prior probabilities are just equal proportions of four groups (1/4). Coefficients mean that the first discriminant function (LD1) is a linear combination of the variables: \(0.065∗zn+0.065∗indus⋯+0.22∗medv\). Proportion of trace is the between group variance. Linear discriminant 1 explains almost 95% of between group variance.

Let’s draw the LDA biplot:

The most influential linear separators for the clusters are rad, zn and nox. I save the correct classes from the test data set, and then remove them from the data frame itself, since I’m going to test my model on it. So this information about correct classification must not be there.

correct_classes <- test$crime
test <- dplyr::select(test, -crime)

Now I will make predictions based on a model:

lda.pred <- predict(lda_model, newdata = test)

And check the quality of prediction with cross-tabulation:

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14      12        1    0
##   med_low    1      17        4    0
##   med_high   0       7       18    1
##   high       0       0        0   27

It can be seen that the model successfully predicts low and high, but fails with middle rates of crime, since they are probably less separable from each other. It also can be clearly visible from the biplot, that green (med_high) and red (med_low) severely clash.

k-means

boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)

For calculating between the observations I will use the most common Euclidean method.

dist_eu <- dist(boston_scaled2, method = "euclidean")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Now I implement k-means algorithm. To determine the optimal number of clusters let’s look at the total of within cluster sum of squares (WCSS) The optimal number of clusters is when the total WCSS drops radically, thus it is 2.

set.seed(123)
# max number of clusters
k_max <- 10
# the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <- kmeans(boston_scaled2, centers = 2)
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

Looking at the plot we can see, that in many variables classes are separable indeed, especially when looking at distributions and correlation coefficients (which vary for two classes). Among the most visible and distinguishable differences between classes are:

  • indus
  • nox
  • age
  • dis
  • rad
  • tax
  • ptratio

Bonus

Let’s perform k-means with 2 clusters.

km_new <- kmeans(boston_scaled2, centers = 3)

I perform LDA using the clusters as target classes.

new_data <- dplyr::select(boston_scaled2, -crim)
new_data <- data.frame(new_data, km_new$cluster)
set.seed(12345)
train_new <- new_data[ind,]
test_new <- new_data[-ind,]
lda_model_new <- lda(km_new.cluster~., data=train_new)
lda_model_new
## Call:
## lda(km_new.cluster ~ ., data = train_new)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2896040 0.4331683 0.2772277 
## 
## Group means:
##           zn      indus        chas        nox         rm        age
## 1 -0.4872402  1.0650531 -0.03677606  1.1318802 -0.5060351  0.7835135
## 2 -0.3992808 -0.1386820  0.08763438 -0.1782981 -0.1640318  0.1855020
## 3  1.1380713 -0.9938046 -0.13171834 -0.9662319  0.7687324 -1.1415994
##           dis        rad        tax     ptratio      black       lstat
## 1 -0.84490528  1.4132224  1.3907131  0.61746316 -0.6404477  0.91465920
## 2 -0.06275337 -0.5802359 -0.5441956 -0.04043224  0.2486883 -0.05609896
## 3  1.07741104 -0.5901619 -0.6745844 -0.55643005  0.3671677 -0.93177559
##          medv
## 1 -0.67072240
## 2 -0.06706528
## 3  0.84549682
## 
## Coefficients of linear discriminants:
##                  LD1         LD2
## zn       0.043309481  0.83222632
## indus   -0.272397029  0.01729030
## chas     0.004084862 -0.17907457
## nox     -0.795442248  0.46890783
## rm       0.110569407  0.39122239
## age      0.082133532 -0.93047496
## dis      0.051331466  0.31679703
## rad     -1.526920272  0.77083066
## tax     -0.858771883  0.18114143
## ptratio -0.055424237  0.01223941
## black    0.042493264 -0.02976446
## lstat   -0.376381397  0.43131410
## medv     0.004158310  0.53794043
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8753 0.1247

Coefficients mean that the first discriminant function (LD1) is a linear combination of the variables: \(0.043∗zn-0.27∗indus⋯+0.004∗medv\) etc. Let’s plot:

classes_new <- as.numeric(train_new$km_new.cluster)
plot(lda_model_new, dimen = 2, col = classes_new, pch = classes_new)
lda.arrows(lda_model_new, myscale = 1)

This time the most influential linear separators for the clusters are rad, age and zn.

correct_classes_new <- test_new$km_new.cluster
test_new <- dplyr::select(test_new, -km_new.cluster)

Super bonus

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda_model$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda_model$scaling
matrix_product <- as.data.frame(matrix_product)

Plotly graph:

## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
km_cluster <- as.data.frame(km$cluster)
km_set <- km_cluster[ind,]

Firstly, the number of groups is different, since for k-means I set two clusters only. But what we can see is that one cluster distinguishes a lot, and the rest of observations are less separable.